# Parallel convolution processing using an integrated photonic tensor core

J. Feldmann<sup>1,\*</sup>, N. Youngblood<sup>2,3,\*</sup>, M. Karpov<sup>4,\*</sup>, H. Gehring<sup>1</sup>, X. Li<sup>2</sup>, M. Le Gallo<sup>5</sup>, X. Fu<sup>4</sup>, A. Lukashchuk<sup>4</sup>, A.S. Raja<sup>4</sup>, J. Liu<sup>4</sup>, C.D. Wright<sup>6</sup>, A. Sebastian<sup>5,#</sup>, T.J. Kippenberg<sup>4,#</sup>, W.H.P. Pernice<sup>1,7,#</sup> and H. Bhaskaran<sup>2,#</sup>

#Correspondence to: wolfram.pernice@uni-muenster.de, harish.bhaskaran@materials.ox.ac.uk, kippenberg@epfl.ch, ASE@zurich.ibm.com

With the proliferation of ultra-high-speed mobile networks and internet-connected devices, along with the rise of artificial intelligence, the world is generating exponentially increasing amounts of data—data that needs to be processed in a fast, efficient and 'smart' way. These developments are pushing the limits of existing computing paradigms, and highly parallelized, fast and scalable hardware concepts are becoming progressively more important. Here, we demonstrate a computational specific integrated photonic tensor core—the optical analog of an ASIC—capable of operating at Tera-Multiply-Accumulate per second (TMAC/s) speeds. The photonic core achieves parallelized photonic inmemory computing using phase-change memory arrays and photonic chip-based optical frequency combs (soliton microcombs). The computation is reduced to measuring the optical transmission of reconfigurable and non-resonant, i.e. broadband, passive components operating at a bandwidth exceeding 14 GHz, limited only by the speed of the

<sup>&</sup>lt;sup>1</sup> Institute of Physics, University of Muenster, Heisenbergstr. 11, 48149 Muenster, Germany

<sup>&</sup>lt;sup>2</sup> Department of Materials, University of Oxford, Parks Road, OX1 3PH Oxford, UK

<sup>&</sup>lt;sup>3</sup>Department of Electrical and Computer Engineering, University of Pittsburgh, 3700 O'Hara St., Pittsburgh, PA 15261, USA

<sup>&</sup>lt;sup>4</sup> Laboratory of Photonics and Quantum Measurements, Swiss Federal Institute of Technology Lausanne (EPFL), Station 3, CH-1015, Lausanne, Switzerland

<sup>&</sup>lt;sup>5</sup> IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland

<sup>&</sup>lt;sup>6</sup> Department of Engineering, University of Exeter, Exeter, EX4 4QF, UK

<sup>&</sup>lt;sup>7</sup> Center for Soft Nanoscience, University of Münster, 48149 Münster, Germany

<sup>\*</sup> These authors contributed equally.

modulators and photodetectors. Given recent advances in hybrid integration of soliton microcombs at microwave line rates, ultra-low loss silicon nitride waveguides, and high speed on-chip detectors and modulators, our approach provides a path towards full CMOS wafer-scale integration of the photonic tensor core. While we focus on convolution processing, more generally our results indicate the major potential of integrated photonics for parallel, fast, efficient and wafer-scale manufacturable computational hardware in demanding AI applications such as autonomous driving, live video processing, and next generation cloud computing services.

The increased demand for machine learning on very large datasets<sup>1</sup> and the growing offering of artificial intelligence services on the cloud<sup>2-4</sup> has driven a resurgence in custom hardware designed to accelerate multiply and accumulate (MAC) computations—the fundamental mathematical element needed for matrix-vector multiplication (MVM) operations. Whilst various custom silicon computing hardware (i.e. FPGAs<sup>5</sup>, ASICs<sup>6</sup>, and GPUs<sup>7</sup>) have been developed to improve computational throughput and efficiency, they still depend on the same underlying electrical components which are fundamentally limited in both speed and energy by Joule heating, RF crosstalk, and capacitance<sup>8</sup>. The last of these (capacitance) dominates energy consumption and limits the maximum operating speeds in neural network hardware accelerators<sup>9</sup> since the movement of data (e.g. trained network weights), rather than arithmetic operations, requires the charging and discharging of chip-level metal interconnects. Thus, improving the efficiency of logic gates at the device level provides diminutive returns in such applications, if the flow of data during computation is not simultaneously addressed<sup>10</sup>. Even recent developments in the use of memristive crossbar arrays<sup>11–13</sup> to compute in the analog domain, whilst promising, do not have the potential for parallelizing the MVM operations (save for physically replicating the elements of the matrix). Moreover, they are plagued by the same

limitations of electronic addressing<sup>14</sup>, with additional challenges in the manufacturing and implementation due to issues with device variability<sup>15,16</sup>, cyclability<sup>17</sup>, and drift<sup>18,19</sup>.

Integrated photonics benefits from the same modularity and scalable fabrication methods of integrated circuits, but has two key advantages over its electronic counterparts: (1) massively parallel data transfer through wavelength division multiplexing (WDM) in conjunction with multichannel sources (i.e. optical frequency combs); and (2) extremely high data modulation speeds limited only by the bandwidth of on-chip optical modulators and photodetectors. These uniquely photonic advantages have led to the ubiquity of optical networks for information transfer, and are presently revolutionizing data centre interconnects (i.e. server-to-switch communication). However, these developments have yet to seriously challenge digital electronics in the arena of information processing. Despite the current dominance of integrated electronics for computing, an application-specific optical processor not limited by the energy-bandwidth trade-off of electrical interconnects<sup>8</sup> could bring the advantages of optical networking to the field of computing. This would result in very high computational throughput via low-latency (i.e. information processing and propagation at light speed) and parallel operations in a single physical optical processing core using WDM.

However, for this to be practically realised, photonic integration and the use of CMOS compatible manufacturing is of paramount importance: on chip, both energy-efficient optical memory units and a compact, broadband multi-channel laser source must be combined within a scalable photonic architecture. Recent work on integrated photonic processors for MVMs and neuromorphic computing<sup>20–22</sup> has shown the potential advantages of the photonic approach, but key issues such as large footprints (11,000 µm² per interferometer unit²0) and the use of thermo-optic heaters to tune the phase or resonance wavelength of their components (ranging on average from 1 mW to 10 mW per heater for ring resonators and Mach-Zehnder interferometers respectively) were bottlenecks²³, as well as the fact that resonant devices such

as add-drop resonators limit the modulation bandwidth. Additionally, while using WDM for processing multiple inputs simultaneously in the same physical hardware has been proposed<sup>24</sup>, it has not yet been demonstrated on-chip.

Here, we design and experimentally demonstrate a novel scalable, CMOS compatible, photonic hardware accelerator (which we term "photonic tensor core" in the following) capable of many parallel MVM operations at optical data rates to process images using convolution filters (here, edge detection and emboss filters). In a departure from electronic accelerators (see Fig. 1a), our photonic processor implements an on-chip matrix multiplication engine capable of performing parallel multiply-accumulate operations using multiple wavelengths, derived from a photonic chip-based optical frequency comb, that are incoherently added within a network of waveguides that exploit phase-change materials. We leverage recent advances in chip-scale microcombs<sup>25,26</sup> operating in the regime of dissipative Kerr soliton (DKS) states, which enable broadband, low-noise, and fully integrated optical frequency combs with line spacing ranging from GHz to THz domains and that are compatible with wafer scale manufacturing and integration with on-chip lasers<sup>27–29</sup>. These devices have already been employed in system level demonstrations such as massively parallel coherent communications<sup>30</sup>, chip-scale frequency synthesizers<sup>31</sup>, and massively parallel LiDAR<sup>32</sup>.

Key to our approach is the encoding of image data onto the individual comb teeth of an on-chip frequency comb, and subsequently encoding fixed convolutional kernels in the non-volatile configuration (i.e. the amorphous or crystalline phase) of integrated phase-change material cells that couple evanescently to a matrix of interconnected photonic waveguides (shown in Fig. 1c). Our approach minimizes both latency and the movement of data, by using non-volatile in-memory photonic MAC operations and greatly reduces the footprint cost of photonics by multiplexing computations in the same photonic core. Importantly, both the soliton microcombs and the matrix of photonic waveguides can be implemented in silicon

nitride<sup>33</sup>, an ultra-low loss, CMOS compatible nonlinear integrated photonic platform, that is compatible with wafer scale manufacturing and foundry. Combined with recent advances in both on chip modulators and hybrid integration of soliton microcombs<sup>27,28</sup>, fully integrated custom photonic tensor cores are now a viable proposition.

#### Realization of parallel 2D convolutions via matrix-vector multiply operations

One prominent class of machine learning that stands much to gain from high throughput accelerators are convolutional neural networks (CNNs) which are highly effective for applications such as in image classification, autonomous navigation, and audio analysis in the frequency domain. In state-of-the-art CNNs, many convolution "hidden layers" are applied to an input signal before feeding the processed data to fully connected layers for classification<sup>34,35</sup>. Each of the convolution layers takes in an input image, performs convolutional operations (with a 'filter') to extract features, and generates an output image. To perform each convolution operation, a filter is passed over the input image inspecting a small window of pixels at a time. A pixel-wise MAC operation between the filter and the current window is carried out to calculate a single pixel of the output image. For the case of a convolution between an input image of dimension  $n \times n$  with  $d_{in}$  channels and a filter of dimension  $k \times k \times d_{in}$ , the resulting output image is of dimension  $(n-k+1)\times(n-k+1)$ . In CNNs,  $d_{out}$  convolution kernels will be applied to the same image, which corresponds to  $(n-k+1)^2 \times k^2 \times d_{in} \times d_{out}$  MAC operations per convolution layer. When performing these operations in the digital domain, a minimum of two clock cycles are typically required for each sequential MAC operation, leading to a significant computational bottleneck and requiring distribution across multiple computing cores, as illustrated in Fig. 1a.

In order to build efficient hardware to perform the convolution operations, one approach (originally conceived for electronic in-memory computing using memristive crossbar arrays<sup>36,37</sup>) is to combine all the convolutional filters into a large filter matrix stored in memory.

As depicted in Fig. 1b, such a filter matrix will be of dimension  $(k^2 \times d_{in}) \times d_{out}$ . It is constructed by stacking the kernel matrices into the columns of the final filter matrix. In the same way the pixels of the input image are rearranged by stacking the pixels of the filter volume,  $(k \times k \times d_{in})$ , into the rows of the input matrix. Hence a single convolution operation involves  $(n-k+1)^2$  MVM operations between the filter matrix and the input vectors of  $k^2 \times d_{in}$  dimension. In the electronic domain, these MVM operations are typically multiplexed in time with parallelization afforded only by physically replicating the filter matrix. In this work, we exploit photonic integrated soliton microcomb and optical WDM to overcome this fundamental limitation by encoding multiple input vectors of dimension  $k^2 \times d_{in}$  onto multiple lines of a coherent chipscale frequency comb, see Fig. 1c. These optical input vectors can then be applied to a single  $(k^2 \times d_{in}) \times d_{out}$  filter matrix simultaneously, thus eliminating duplicated physical hardware and sequential operations. This approach will be employed when designing the photonic tensor core.

# The photonic tensor core

First, we demonstrate how to perform an MVM operation in the optical domain using photonic integrated circuits employing non-volatile phase-change cells that store analog values of the matrix *in situ*<sup>38</sup>. Details of using phase-change materials (PCMs) on single devices are described elsewhere <sup>38,39</sup>. In this work, the PCM (Ge<sub>2</sub>Sb<sub>2</sub>Te<sub>5</sub> or GST) cells are employed as attenuating matrix elements which absorb a desired amount of light depending on their particular phase configuration. In the crystalline PCM state, most of the incoming light is absorbed, representing for example a "0". In the amorphous state, most of the light is transmitted, thus representing a "1". Intermediate transmission states can be chosen by controllably switching fractions of amorphous and crystalline parts in the PCM cell<sup>38,40</sup>. To achieve both positive and negative matrix elements, we here define "0" as an intermediate state between the crystalline and amorphous states as described in the Supplementary Information.

In order to calculate the  $i \times j$  MVM operation shown at the top of Fig. 2a, the input vector is encoded in the amplitude of the optical signals sent to the different matrix inputs. In addition to amplitude at a given wavelength, the input vector is also encoded at different wavelengths providing the ability for multiple calculations to be carried out simultaneously. The amplitude of each wavelength represents one of the vector entries  $(X_1, ... X_i)$ . Therefore, the input vectors can be fed to the matrix by modulating the input wavelengths with currently available fast electro-optical modulators, providing access to very high data rates. The matrix itself is designed as a waveguide crossbar array with additional directional couplers that equally distribute the input power to all PCM-cells (more details of the splitting ratios of the directional couplers are given in the supplementary information). By using a soliton microcomb with a mode spacing that exceeds the detector bandwidth, interference inside the waveguides can be avoided and the summation of the individual products (of the matrix-vector multiplications) can be performed by adding the comb teeth to the output waveguides, also by using directional couplers. With the horizontal directional couplers, the input vectors are equally distributed to the different columns of the matrix (which represent the individual image kernels) whereas the vertical directional couplers combine the input light after interaction with the PCM cells and perform the accumulation operation. It should be noted that each vector entry only interacts with a single PCM cell per matrix column. This interaction can be viewed as a single multiplication between the incoming amplitude and the absorption of the phase-change cell, as has been shown in previous work<sup>41</sup>. The output power at each column of the matrix finally represents an inner-product (the sum of the individual products) of the input vector with a kernel multiplied by a certain (fixed) factor  $\left(\frac{1}{ij}\right)$  which depends on the matrix size. Power distribution due to fan-out accounts for the 1/j loss, while combining i non-interfering sources with directional couplers accounts for the additional 1/i loss due to energy conservation.

Figure 2b depicts a scanning electron micrograph of the resonator used for comb generation whereas Fig. 2c shows an optical micrograph of a fabricated 4x4 matrix. Key chip regions are magnified in the scanning-electron micrographs on the right. Coupling of light into the optical chip is achieved using broadband total internal reflection (TIR) couplers<sup>42,43</sup> (bottom right of Fig. 2c). The TIR couplers provide access to a wide wavelength spectrum and thus allow the coupling of multiple wavelengths into the chip. The PCM-cells (of area 3×3 μm²) acting as the matrix elements are deposited on top of waveguide crossings (Fig. 2c top right). Each individual matrix cell has three additional grating couplers used to optically address the PCM. By sending pulses (via the middle coupler) to the waveguide directly leading to the PCM cell on the crossing, it can be optically switched for programming each matrix element (in this case the light is coupled to the chip using Bragg-grating couplers because operation at a single wavelength (1550 nm) is sufficient).

In addition to substantial benefits in modulation speed (for changing the vector inputs), an optical implementation of a matrix-vector multiplier allows the harnessing of wavelength division multiplexing (MUX) to execute parallel MVM operations. In particular, as Fig. 2d explains, the same matrix can be used to process several input vectors at the same time when all the individual vectors are encoded on different wavelengths. For the 4×4 matrix example shown in Fig. 2, and the processing of four input vectors per time step, sixteen different wavelengths are needed. In this work, these wavelengths are generated using a single DKS state of a microcomb<sup>26,44,45</sup> which is fed into a demultiplexer to split up the individual wavelengths ( $\lambda_1 - \lambda_{16}$ ). After manipulating the amplitude of each comb line individually (according to the value of the input vectors) by using variable optical attenuators (VOAs), the corresponding entries of each vector are multiplexed back together (i.e.  $\lambda_1$ ,  $\lambda_5$ ,  $\lambda_9$ ,  $\lambda_{13}$ ) and sent to the matrix input. After propagating through the filter matrix, all output waveguides of the matrix contain all 16 input wavelengths. Proper demultiplexing and combining of the wavelengths

corresponding to the individual vectors yields the convolution results that can be measured with photodetectors. In the current example, 16 inner-product operations (four kernels applied to four input vectors) are carried out in a single time step. Depending on the number of lines available in the frequency comb, the multiplexing scheme can be extended further leading to significant speed gains. Figure 2e shows the optical spectrum of an on-chip microcomb, revealing lines with 100 GHz spacing over a range of more than 25 THz.

To illustrate the principle outlined above experimentally, the convolution of an input depicting a handwritten "4" (Fig. 3a)<sup>46</sup> is performed using four 3×3 image kernels (resulting in a  $9\times4$  filter matrix) and a single vector ( $9\times1$ ) per time step (Fig. 3b-e). Note that,  $d_{in} = 1$  and  $d_{out} = 4$  in this example. The image kernels applied in this example are chosen for edge detection and are shown below the output images (for more details on how the matrix elements are exactly defined in the PCM-state, see supplementary information). After obtaining the results of the matrix-vector multiplications, the output values are offset by +0.5 and the values below 0 are set to 0 (black pixel) and the values above 1 are set to 1 (white). Each of the kernels highlights different edges of the original image: Fig. 3b, for example, highlights upper edges, whereas Fig. 3d brings out the opposite lower edges. Since the four kernels are all inscribed in the same matrix, one pixel value for each of the four output images are obtained simultaneously, totalling more than 63,000 inner-product operations to process the entire image. The edge features are strongly visible, which emphasizes the effectiveness of our optical convolution operation. The variation in the background is due to power fluctuations of the comb lines over time, leading to small errors in the matrix-vector multiplications. It should be noted that in the examples of Fig. 3b-e, for each optical matrix-vector multiplication, an MVM operation in software is also performed in a post-processing step to subtract a certain reference power from the measured output power in the matrix columns (more details are provided in the supplementary information). In order to avoid this post-processing, the reference convolution operation can also be performed optically in the same on-chip matrix. In this case, one matrix column is left in a reference state (for example all PCM-cells in the crystalline phase state). The output value from this column is then subtracted from all the matrix columns holding the actual image kernels. Figure 3f shows an experimental example of a convolution operation which was performed without electrical post processing using reference subtraction. Here, a  $3\times3$  kernel (emboss filter) was applied using a  $9\times2$  matrix, with one column for the image kernel and one column for the reference. The original image is shown on the left, while the experimental output image after the convolution operation is shown in the middle panel. From comparison with the calculated expected output on the right, it can be seen that the on-chip matrix also performs well without the need for the post-processing step. It should be noted, that even though the image has three color channels red, green and blue ( $d_{in} = 3$ ), the convolutions are performed on each channel independently and combined in the end leading to the output image – however, this is more a limitation of the size of our hardware matrix as opposed to a fundamental limitation of this technology.

Having demonstrated the basic capabilities of our phase-change nanophotonic approach to performing convolution operations in the optical domain, we now show, in Fig. 4, experimental examples of processing four input vectors in parallel at the same time. In this case, four pixels of the new image are obtained per image kernel simultaneously, therefore shortening the processing time by a factor of four. The kernel size used for this experiment is  $2\times2$  and the input dimension of the image,  $d_{in}=1$ , leading to a  $4\times4$  filter matrix. The convolutions again highlight different edges which can be clearly seen—for example, in the representation of the bricks in the upper image. Panel b) and e) emphasize vertical edges, whereas panels c) and d) highlight horizontal edges. This is in spite of variations in the vertical direction due to power fluctuations of the input signal, underlining the robustness of the technique.

## **Projections to the future**

The above data was obtained with matrices up to a size of  $9\times4$ , with maximal four input vectors per time step and a relatively slow modulation speed resulting from using the variable optical attenuators (approximately 1 kHz). To estimate the ultimate performance capabilities of the system, we now operate the tensor processor using high speed electro-optical modulators and multiple comb lines. Because the photonic system is designed with broadband input couplers and broadband directional couplers in silicon nitride with a wide optical transparency window, the tensor processor supports many comb lines from the frequency comb source. Figure 5a shows the optical spectrum of the frequency comb after transmission through the matrix, with lines in a wide range of over 100 nm with a spacing of 100 GHz, thus providing access to more than 200 individual wavelengths. The inset depicts a zoom into the 16 frequency lines that were used throughout the experiments discussed above. Besides the spectral width of the frequency comb, the influence of wavelength dependent parts in the matrix design have to also be considered when estimating the wavelength range exploitable for the calculations. In this case, it is especially the wavelength dependence of the directional couplers that hinders the equal distribution of the input power for all wavelengths. The simple design of the couplers applied here still offers a range of approximately 100 nm (see supplementary information) but can be considerably improved by an adapted design<sup>47</sup>.

Each of the comb lines can be used for encoding vector values by setting the power with electro-optical modulators. Figure 5b shows the frequency response of the matrix for frequencies up to 14 GHz. The data illustrates the influence of the matrix on each input for modulation frequencies up to 14 GHz and was obtained by first characterizing the complete setup and then subtracting the frequency response of the setup with the matrix excluded. The flat response shows that the matrix only acts as a passive element during the convolution operations and does not limit the operational speed. In this experiment, the maximum frequency

was determined by the photodiode which is specified (-3 dB bandwidth) up to 12 GHz. The inset shows an open eye-diagram at a rate of 13.5 GHz. Thus, considering a 9×4 matrix, four multiplexed input vectors and a modulation speed of 14 GHz, a processing speed of 2 TMAC/s (9×4 MACs × 4 input vectors × 14 GHz) can be obtained. This, however, is not the ultimate speed, since we are limited here by the modulation and detection bandwidth of our particular experimental setup.

To analyze the accuracy of the optical convolution processor, randomly chosen input vectors with nine entries are processed using a fixed matrix and are compared to the expected analytically calculated multiplication result. The results for 100,000 calculations are scaled to the range [0,1] and plotted in Fig. 5c and the inset shows the corresponding histogram revealing a standard deviation of 0.008. Figure 5d shows the calculated standard deviation for a fixed-point scalar multiplication as a function of the precision of the input values. From the intersection with the horizontal line indicating the experimentally found standard deviation a resolution of 5 bits can be identified. The resolution is determined by the repeatability of the modulator settings, the noise of the detectors and the stability of the light source.

When comparing optical architectures with digital electronics, it is helpful to use compute density (defined as MACs per second normalized by the area<sup>48</sup>) as a figure-of-merit for performance. This helps to directly compare the processing throughput of architectures that may employ very different schemes for computing MVM operations. For the SiN devices demonstrated here, the area of a single MAC unit cell is 285 μm × 354 μm. Operated at 14 GHz with 4 input vectors via WDM, this corresponds to a compute density of 555 GMACs/mm<sup>2</sup>. This is a factor of 4 greater than Google's recently developed custom tensor processing ASIC<sup>6</sup> (operating at 8-bit precision) with a compute density of 150 GMACs/mm<sup>2</sup>. We note that by moving to a silicon-on-insulator platform with a nominal bend radius of 5 μm and using integrated electrical control of the GST<sup>49,50</sup>, it would be straight-forward to reduce the area of

the MAC unit cell to less than  $30 \times 30 \ \mu m^2$ , increasing the compute density to  $15.6 \ TMACs/mm^2$  per input channel. This is  $\sim 100 \times$  improvement over digital implementations and scales linearly with the number of input vectors via WDM—a notably different computing paradigm compared to electronic approaches.

To estimate the full capabilities of the optical accelerator for convolution operations, the performance of common optical components in foundry services  $^{51,52}$  must be considered in combination with the wavelength range of the frequency comb that can be used. The frequency comb clearly shows lines from 1500 nm to 1650 nm (see Fig. 5a), leading to a range of 150 nm that could be exploited for computation that can even be extended by optimizing the setup. Considering the spacing of the comb lines of 100 GHz (0.8 nm), this leads to approximately  $150 \text{ nm} / 0.8 \text{ nm} = 187 \text{ different wavelengths. Decreasing the spacing to 50 GHz (0.4 nm) and increasing the matrix size to <math>50 \times 50$ , the operational speed can reach an unprecedented 1 PMAC/s with a single matrix device, assuming a modulation and detection speed of 50 GHz.

#### Conclusion

We describe the first instance of a photonic tensor core which combines in-memory computing with state-of-the-art photonic integrated frequency combs enabling parallelizing convolution operations in the same physical device. We demonstrate the simultaneous data transfer and computing at speeds comparable to fiber networks. Prior optical approaches to computing have largely been limited by a lack of integrated non-volatile photonic memory and the lack of multiplexing capability for such calculations<sup>20,22,48</sup>. Our approach overcomes both these limitations by using nonvolatile, phase-change materials integrated on waveguides to locally store convolution kernels on-chip, and photonic chip-based frequency combs which enables true in-memory photonic computing using WDM capability. The photonic tensor core demonstrated in this work is capable of operating at the speed of 2 TMAC/s, promising even

faster operation by an increase of several orders of magnitude by moderate scaling with stateof-the-art foundry processes. A key feature of this is that, because the convolution operation is
a passive transmission measurement, the calculations can in theory be performed at the speed
of light at very low power, experimentally limited only by the modulation and detection
bandwidths. Making use of the wavelength division multiplexing capabilities inherent to alloptical systems, our fast and parallellized implementation promises higher computational
bandwidths when compared to electronic devices, as several pixels or even complete images
can potentially be processed in a single time step. Our approach for convolution processing
provides an effective method to remove the computing bottleneck in machine learning hardware
for applications ranging from live video processing to autonomous driving and AI-aided lifesaving applications. More importantly, such an approach more broadly suggests that integrated
photonics are coming of age and in some cases can begin to match and even challenge electronic
computation.

#### References

- 1. Ben-Nun, T. & Hoefler, T. Demystifying parallel and distributed deep learning: An indepth concurrency analysis. *ACM Comput. Surv.* **52**, (2019).
- 2. Amazon AWS Machine Learning. Available at: https://aws.amazon.com/machine-learning/. (Accessed: 25th December 2019)
- Google Cloud. Available at: https://cloud.google.com/products/machine-learning/.
   (Accessed: 25th December 2019)
- 4. Microsoft Azure. Available at: https://azure.microsoft.com/en-us/overview/machine-learning/. (Accessed: 25th December 2019)
- 5. Zhang, C. *et al.* Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. *ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays(FPGA)* 161–170 (2015). doi:10.1145/2684746.2689060

- 6. Jouppi, N. P. *et al.* In-Datacenter Performance Analysis of a Tensor Processing Unit. *Proc. ISCA '17* (2017). doi:10.1145/3079856.3080212
- 7. Wang, P. S., Liu, Y., Guo, Y. X., Sun, C. Y. & Tong, X. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. *ACM Trans. Graph.* **36**, (2017).
- 8. Miller, D. A. B. Attojoule Optoelectronics for Low-Energy Information Processing and Communications. *J. Light. Technol.* **35**, 346–396 (2017).
- 9. Agrawal, S. R. *et al.* A Many-core architecture for in-memory data processing. *Proc. Annu. Int. Symp. Microarchitecture, MICRO* 245–258 (2017). doi:10.1145/3123939.3123985
- 10. Miller, D. A. B. Are optical transistors the logical next step? *Nat. Photonics* **4**, 3–5 (2010).
- 11. Ielmini, D. & Wong, H. S. P. In-memory computing with resistive switching devices.

  Nat. Electron. 1, 333–343 (2018).
- 12. Le Gallo, M. *et al.* Mixed-precision in-memory computing. *Nat. Electron.* **1**, 246–253 (2018).
- 13. Boybat, I. *et al.* Neuromorphic computing with multi-memristive synapses. *Nat. Commun.* **9**, (2018).
- Hu, M. et al. Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication. Proc. Des. Autom. Conf. (2016). doi:10.1145/2897937.2898010
- 15. Gong, N. *et al.* Signal and noise extraction from analog memory elements for neuromorphic computing. *Nat. Commun.* **9**, (2018).
- 16. Boybat, I. *et al.* Stochastic weight updates in phase-change memory-based synapses and their influence on artificial neural networks. *PRIME 2017 13th Conf. PhD Res. Microelectron. Electron. Proc.* **2**, 13–16 (2017).
- 17. Yang, T. Y., Park, I. M., Kim, B. J. & Joo, Y. C. Atomic migration in molten and

- crystalline Ge2 Sb2 Te5 under high electric field. Appl. Phys. Lett. 95, (2009).
- 18. Koelmans, W. W. *et al.* Projected phase-change memory devices. *Nat. Commun.* **6**, (2015).
- 19. Kim, S. *et al.* A phase change memory cell with metallic surfactant layer as a resistance drift stabilizer. *2013 IEEE Int. Electron Devices Meet.* 762–765 (2013). doi:10.1109/IEDM.2013.6724727
- 20. Shen, Y. *et al.* Deep learning with coherent nanophotonic circuits. *Nat. Photonics* **11**, 441–446 (2017).
- 21. Tait, A. N. et al. Silicon Photonic Modulator Neuron. Phys. Rev. Appl. 11, (2019).
- 22. Pérez, D. *et al.* Multipurpose silicon photonics signal processor core. *Nat. Commun.* **8**, (2017).
- 23. Galal, S. & Horowitz, M. Energy-efficient floating-point unit design. *IEEE Trans. Comput.* **60**, 913–922 (2011).
- 24. Bangari, V. *et al.* Digital Electronics and Analog Photonics for Convolutional Neural Networks (DEAP-CNNs). *IEEE J. Sel. Top. Quantum Electron.* **26**, (2020).
- 25. Herr, T. *et al.* Temporal solitons in optical microresonators. *Nat. Photonics* (2014). doi:10.1038/nphoton.2013.343
- Herr, T., Gorodetsky, M. L. & Kippenberg, T. J. Dissipative Kerr Solitons in Optical Microresonators. *Nonlinear Opt. Cavity Dyn. From Microresonators to Fiber Lasers* 8083, 129–162 (2015).
- 27. Raja, A. S. *et al.* Electrically pumped photonic integrated soliton microcomb. *Nat. Commun.* (2019). doi:10.1038/s41467-019-08498-2
- 28. Stern, B., Ji, X., Okawachi, Y., Gaeta, A. L. & Lipson, M. Battery-operated integrated frequency comb generator. *Nature* (2018). doi:10.1038/s41586-018-0598-9
- 29. Jones, R. *et al.* Heterogeneously Integrated InP/Silicon Photonics: Fabricating fully functional transceivers. *IEEE Nanotechnol. Mag.* (2019).

- doi:10.1109/MNANO.2019.2891369
- 30. Marin-Palomo, P. *et al.* Microresonator-based solitons for massively parallel coherent optical communications. *Nature* **546**, 274–279 (2017).
- 31. Spencer, D. T. *et al.* An optical-frequency synthesizer using integrated photonics.

  Nature 557, 81–85 (2018).
- 32. Riemensberger, J. *et al.* Massively parallel coherent laser ranging using soliton microcombs. 1–18 (2019).
- 33. Moss, D. J., Morandotti, R., Gaeta, A. L. & Lipson, M. New CMOS-compatible platforms based on silicon nitride and Hydex for nonlinear optics. *Nature Photonics* (2013). doi:10.1038/nphoton.2013.183
- 34. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition.

  Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 770–778 (2016).

  doi:10.1109/CVPR.2016.90
- 35. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. *3rd Int. Conf. Learn. Represent. ICLR 2015 Conf. Track Proc.* 1–14 (2015).
- 36. Gao, L., Chen, P. Y. & Yu, S. Demonstration of Convolution Kernel Operation on Resistive Cross-Point Array. *IEEE Electron Device Lett.* **37**, 870–873 (2016).
- 37. Shafiee, A. *et al.* ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. *Proc. 2016 43rd Int. Symp. Comput. Archit. ISCA 2016* 14–26 (2016). doi:10.1109/ISCA.2016.12
- 38. Li, X. *et al.* Fast and reliable storage using a 5 bit, nonvolatile photonic memory cell. **6**, (2019).
- 39. Ríos, C. *et al.* Integrated all-photonic non-volatile multi-level memory. *Nat. Photonics* **9**, 725–732 (2015).
- 40. Feldmann, J. et al. Calculating with light using a chip-scale all-optical abacus. Nat.

- Commun. 8, (2017).
- 41. Ríos, C. et al. In-memory computing on a photonic platform. Sci. Adv. 5, (2019).
- 42. Gehring, H. *et al.* Low-loss fiber-to-chip couplers with ultrawide optical bandwidth. *APL Photonics* **4**, 0–7 (2019).
- 43. Gehring, H., Eich, A., Schuck, C. & Pernice, W. H. P. Broadband out-of-plane coupling at visible wavelengths. *Opt. Lett.* **44**, 5089 (2019).
- 44. Gaeta, A. L., Lipson, M. & Kippenberg, T. J. Photonic-chip-based frequency combs.

  Nature Photonics 13, (2019).
- 45. Pfeiffer, M. H. P. *et al.* Photonic Damascene Process for Integrated High-Q Microresonator Based Nonlinear Photonics. *Optica* **3**, 1–6 (2016).
- 46. Grother, P. J. & Hanaoka, K. K. NIST Special Database 19 Handprinted Forms and Characters Database. *Tech. Rep. Spec. Database 19* 1–30 (2016). doi:10.18434/T4H01C
- 47. Lu, Z. *et al.* Broadband silicon photonic directional coupler using asymmetric-waveguide based phase control. *Opt. Express* **23**, 941–947 (2015).
- 48. Nahmias, M. A. *et al.* Photonic Multiply-Accumulate Operations for Neural Networks. *IEEE J. Sel. Top. Quantum Electron.* (2019). doi:10.1109/jstqe.2019.2941485
- 49. Farmakidis, N. *et al.* Plasmonic nanogap enhanced phase change devices with dual electrical-optical functionality. *Sci. Adv.* **5**, 1–8 (2019).
- 50. Zhang, H. *et al.* Miniature Multilevel Optical Memristive Switch Using Phase Change Material. *ACS Photonics* **6**, 2205–2212 (2019).
- 51. Atabaki, A. H. *et al.* Integrating photonics with silicon nanoelectronics for the next generation of systems on a chip. *Nature* **556**, (2018).
- 52. Wang, X. & Liu, J. Emerging technologies in Si active photonics. *J. Semicond.* **39**, (2018).
- 53. Gehring, H., Blaicher, M., Hartmann, W. & Pernice, W. H. P. Python based open

- source design framework for integrated nanophotonic and superconducting circuitry with 2D-3D-hybrid integration. *OSA Contin.* **2**, 3091–3101 (2019).
- 54. Liu, J. *et al.* Ultralow-Power Chip-Based Soliton Microcombs for Photonic Integration.

  Optica 5, (2019).
- 55. Guo, H. *et al.* Universal dynamics and deterministic switching of dissipative Kerr solitons in optical microresonators. *Nat. Phys.* (2017). doi:10.1038/nphys3893
- 56. Karpov, M. *et al.* Dynamics of soliton crystals in optical microresonators. *Nat. Phys.* (2019). doi:10.1038/s41567-019-0635-0

## Methods

#### **Device fabrication**

The photonic circuits used for the convolution experiments are fabricated using a three-step electron-beam lithography (EBL; Raith EBPG 5150) process on a silicon nitride (325 nm) on silicon oxide (3300 nm) on silicon wafer (Rogue Valley Microdevices). The complete circuit was designed using GDShelpers, a design framework for integrated circuitry<sup>53</sup>.

In the first lithography step, windows in the positive tone resist Polymethylmethacrylat (PMMA) are exposed for the deposition of alignment markers made from gold. The resist is developed in 1:3 MIBK: Isopropanol for 120 seconds and a layer stack of 5 nm chromium, 120 nm gold and 5 nm chromium are evaporated via electron-beam physical vapour deposition (EBPVD). By sonicating the chip in acetone, the PMMA is removed and only the gold markers in the exposed positions remain. The markers are used in the second step to align the photonic structures. After spin coating a layer of 300 nm of the resist and prebaking it for 60 seconds at 85°C, an etch mask is exposed in the negative-tone ebeam resist arN 7520.12. The photonic structures are developed in MF-319 for 75 seconds and a post-development bake is performed at 85°C for 60 seconds. Using reactive ion etching with a CHF3/O2 plasma the mask of the photonic circuits is transferred into the sample. The silicon nitride layer is fully etched leaving single mode waveguides at telecom-wavelengths with a width of 1.2 µm and a height of 325 nm. Subsequently the remaining resist is removed in an oxygen plasma for 10 minutes. In the third EBL step, windows for the deposition of the phase-change material are written using the same markers as for the photonic structures for the alignment. The same process as in the first EBL step is used. Finally, 10 nm of the phase-change material GST and 10 nm of indium tin oxide (ITO) are sputter deposited on the sample. Both layers are sputtered using RF sputtering with an argon plasma (5 mtorr pressure, 15 sccm Ar, 30 W RF power and a base pressure of 2×10<sup>-6</sup> Torr). The ITO is used as a protective film to prevent oxidation of the phase-change

material. As in the marker-deposition, the PMMA is lifted off by sonicating the sample in acetone leaving the phase-change material only in the desired positions on the photonic circuitry. Prior to the experiments the GST is crystallized on a hot plate at 220°C for approximately 10 minutes.

#### Measurement setup

The experimental setups used to perform the convolution experiments are shown in supplementary figures S1-S3. The individual wavelengths are generated using a frequency comb that is operated in the single soliton state and separated using a fibre-based multiplexer. For the image processing experiments (Fig. 3 and 4) the wavelengths (input vectors) are modulated using variable optical attenuators based on micro-electro-mechanical systems (MEMS), whereas the fast modulation (Fig. 5) was performed with a 20 GHz electro-optic modulator (EOM). The input signal is coupled to the chip using 3D printed broadband total internal reflexion couplers capable of operating from the visible to the telecom wavelengths regime.

In the multiplexed version of the experiment processing four vectors at the same time, the corresponding wavelengths are multiplexed and demultiplexed accordingly before and after the matrix again using fiber-multiplexers. The convolution results are read using photodetectors (New Focus Model 2011). In the frequency response experiment (Fig. 5) a fast photodiode (12 GHz) was utilised.

### Realization of high-Q Si3N4 microresonators

The soliton microcombs used in our work are based on  $Si_3N_4$  microring resonators with free spectral range (FSR) of 100 GHz shown in Fig. 2 b. The microresonators are fabricated using Photonic Damascene process  $^{45}$ , which provides access to high Q factors reaching  $10^7$  and

enables the four-wave-mixing based nonlinear frequency conversion processes as well as the formation of DKS states at low pump powers <sup>54</sup>.

The microresonators were designed to have cross-section dimensions of 0.82 x 1.50 μm, which ensure anomalous group velocity dispersion (GVD) of about 1-2 MHz at around 1550 nm needed for the Kerr comb generation and the formation of DKS states. The light is coupled evanescently to a microresonator via the on-chip bus waveguide with similar dimensions located close to the microring, and which are additionally equipped with inverse tapers at the ends for edge chip coupling. Employed Si<sub>3</sub>N<sub>4</sub> chips are furthermore fiber-packaged with average loss 4 dB/interface to facilitate the light coupling in and out of the system. The fabricated devices have Q-factors exceeding 5 x 10<sup>6</sup>, which allows for the DKS generation and switching <sup>55,56</sup> even for relatively low input pump powers below 1W.

# Soliton comb generation

For the DKS generation a Si<sub>3</sub>N<sub>4</sub> microring resonator is driven using continuous wave tunable fiber laser which is amplified with an Erbium-doped fiber amplifier (EDFA) to the power level of about 1 W. A high-power bandpass filter is used to suppress the amplified spontaneous emission (ASE) from the EDFA. The light polarization is adjusted using fiber-based polarization controller to match the TE-polarized fundamental mode of the microresonator, and then is launched to the fiber-coupled Si<sub>3</sub>N<sub>4</sub> chip.

In order to launch the DKS state, a standard pump tuning technique is applied <sup>25</sup>where the amplified seed laser is swept over the choosen frequency resonance from the blue-detuned side to the red detuned side at a speed of approximately 200 GHz/s. This approach allows to generate multiple-soliton states with several pulses inside the cavity, which however usually has highly structures optical spectrum. In order to achieve the single DKS state with spectrally smooth sech<sup>2</sup>—shape envelope the soliton switching procedure is employed <sup>55</sup>and pump is slowly tuned

toward shorter wavelength until the single soliton state is stabilized. In order to improve the long-term stability of the generated DKS states and align the resulting optical frequency comb to the established International Telecommunication Union (ITU) grids, the  $Si_3N_4$  chip is thermally controlled which enables the usage of the standard WDM equipment and optical comb stabilization against environmental temperature fluctuations and setup drifts ensuring > 8 hours of continuous operation.

The resulting DKS-based optical frequency comb with 100 GHz line spacing and spanning over multiple telecommunication bands is coupled out from the chip. The residual pump is suppressed using fiber-based notch filter, and a small portion of the light (1 %) is used for the monitoring purposes. The rest of the comb is shown in Fig.2 e, and is then additionally amplified with C-band EDFA to further employ it in the setup for image vectors encoding and demultiplexing.



**Figure 1. Photonic in-memory computing using an on-chip frequency comb and phase-change materials. a)** A comparison of digital and analog electronic architectures with our photonic tensor core architecture. Digital electronics (left) requires many sequential processing steps distributed across multiple cores to compute convolution operations on an image, while an entire matrix-vector multiplication (MVM) can be performed in one step using analog

electronic in-memory computing (center). Photonic in-memory computing brings wavelength multiplexing as an additional degree of freedom, enabling multiple MVM operations in a single time step. **b)** An input image with  $d_{in}$  channels is convolved with  $d_{out}$  kernels of size  $k \times k$  by mapping convolution operations into a sequence of MVM operations. The input image is mapped to a series of  $(n-k+1)^2$  input vectors of size  $(d_{in} \times k^2) \times 1$  and multiplied by a filter matrix of dimension  $(d_{in} \times k^2) \times d_{out}$ . Each comb line corresponds to one entry of the input vector and is modulated according to the pixel values of the input matrix. **c)** Conceptual illustration of a fully integrated photonic architecture to compute convolution operations. An on-chip laser (not used) pumps an integrated SiN soliton microcomb to generate a broadband frequency comb. Individual comb teeth which form the input vectors are modulated at high speeds, multiplied with a matrix of non-volatile phase-change memory cells, and summed along each column on a photodetector.



Figure 2. Concept of photonic tensor cores for convolution operations. a) Basic matrix-vector multiplication: A vector is encoded in the amplitude of individual comb teeth of a silicon nitride  $(Si_3N_4)$  photonic integrated soliton frequency comb (microcomb) exhibiting wavelengths (" $X_1$ " to " $X_i$ ") and send to the corresponding matrix input waveguides. The matrix elements are inscribed in the state of phase-change material patches on the waveguides. The

splitting ratios of the directional couplers are chosen such that the same fraction of the light for each input reaches the output. **b)** Scanning electron micrograph of a microresonator used for frequency comb generation **c)** Optical micrograph of a fabricated 4×4 matrix with 3D printed input and output couplers to enable broadband operation. The close-up SEM images on the right show the 3D printed couplers (bottom) and the waveguide crossings with the PCM (top) in more detail. **d)** Sketch of the multiplexed all-optical matrix-vector multiplication. The input vectors are generated from lines of a photonic chipscale dissipative Kerr soliton (DKS) frequency comb using a multiplexer (MUX) and variable optical attenuators (VOAs). The entries of different input vectors are grouped together again employing wavelength multiplexing and sent to the on-chip MAC-unit (Multiply-Accumulate-unit) that performs the calculations. After combining the correct wavelengths with optical demultiplexers (DEMUX), the multiplication results are be obtained. Note that in the given example four kernels and four input vectors are operated at once, resulting in 64 MAC-operations per time step. **e)** Measured spectrum of a single-soliton frequency comb.



**Figure 3. Convolution using sequential MVM operations. a-e)** Experimental result of convolving a 128×128 pixel image showing a handwritten digit (a) with four image kernels of the size 3×3 (corresponding to a 9×4 filter matrix). The kernels are chosen to highlight different edges of the input image. **f)** Convolution operation with a 3×3 sized image kernel (i.e. emboss filter) without post-processing. The image on the left shows the original image while the other two depict the experimental and the calculated (correct) result.



**Figure 4. Convolution using parallel MVM operations. a-e)** The original input images are shown on the left (a) and the output images using four different image kernels for highlighting edges are shown in (b)-(e). The size of the four image kernels is 2×2 corresponding to a 4×4 filter matrix. In each time step, four input vectors are processed simultaneously via wavelength division multiplexing as illustrated in Fig. 2c.



**Figure 5. TeraMAC/s operation using the photonic tensor core. a)** Transmission spectrum of the coherent soliton frequency comb through the device. Inset: spectrum of the 16 selected comb lines used in the experiment. **b)** Modulation bandwidth of all comb lines, showing modulation rates up to 14 GHz. Inset: open eye diagram at 13.5 GHz. **c)** Calculation accuracy for 100,000 inner-product operations multiplying a vector of nine entries with a fixed matrix. Inset: Histogram of the data revealing a standard deviation of 0.008. **d)** Calculated expected standard deviation for fixed-point arithmetic with different resolutions. The green line indicates the experimentally obtained standard deviation and the intersection point therefore the resolution achieved with photonic MVM operations (5 bit).